{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# NLP Architect - Named Entity Recognition tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "Let's start by importing all the important classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from keras.utils import to_categorical\n",
    "from nlp_architect.data.sequential_tagging import SequentialTaggingDataset\n",
    "from nlp_architect.models.ner_crf import NERCRF\n",
    "from nlp_architect.utils.embedding import get_embedding_matrix, load_word_embeddings\n",
    "from nlp_architect.utils.metrics import get_conll_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the data\n",
    "\n",
    "Load a dataset using the `SequentialTaggingDataset` dataset loader.\n",
    "The files should be tagged in `BIO` format and each token should appear in a separate line with its tags separated by tabs. For example: `A B-ENTITY`.  Sentence should be separated by an empty line.\n",
    "\n",
    "Change the train/test paths below to your "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = 'train.txt'\n",
    "test = 'test.txt'\n",
    "\n",
    "sentence_length = 50\n",
    "word_length = 12\n",
    "\n",
    "dataset = SequentialTaggingDataset(train, test,\n",
    "                                   max_sentence_length=sentence_length,\n",
    "                                   max_word_length=word_length,\n",
    "                                   tag_field_no=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the train and test sets - we have 2 inputs and 1 output (word and chars, and entity type for outout)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train, x_char_train, y_train = dataset.train\n",
    "x_test, x_char_test, y_test = dataset.test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert output matrices into 1-hot encoding.\n",
    "Prepare sequence length for train and test data (used by the CRF classification layer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_y_labels = len(dataset.y_labels) + 1\n",
    "y_test = to_categorical(y_test, num_y_labels)\n",
    "y_train = to_categorical(y_train, num_y_labels)\n",
    "train_lengths = np.sum(np.not_equal(x_train, 0), axis=-1).reshape((-1, 1))\n",
    "test_lengths = np.sum(np.not_equal(x_test, 0), axis=-1).reshape((-1, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### External word embedding model\n",
    "\n",
    "We can use pre-train word embedding models with this network. NLP Architect can load GloVe and Fasttext type of models (text files in <word> <vector weights> format).\n",
    "\n",
    "For more information and download links visit official pages of [GloVe](https://nlp.stanford.edu/projects/glove/) and [Fasttext](https://fasttext.cc/docs/en/english-vectors.html).\n",
    "\n",
    "_(The terms and conditions of the data set license apply. Intel does not grant any rights to the data files)_\n",
    "\n",
    "In the example below we point `embedding_path` to a local GloVe model of size 100."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_path = 'glove.6B.100d.txt'\n",
    "embedding_size = 100\n",
    "embedding_model, _ = load_word_embeddings(embedding_path)\n",
    "embedding_mat = get_embedding_matrix(embedding_model, dataset.word_vocab)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating the model\n",
    "The NER model we're going to build is depicted below:\n",
    "\n",
    "![image.png](attachment:image.png)\n",
    "\n",
    "We have 2 input source (words and word characters), a bi-directional LSTM layer and a CRF layer for token classification.\n",
    "\n",
    "There's a class in NLP Architect that implements the above model, called `NERCRF` from `nlp_architect.models.ner_crf`.\n",
    "Build a new model using your chosen parameters, and load the external word embedding vectors into the embedding layer of the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ner_model = NERCRF()\n",
    "ner_model.build(word_length,\n",
    "                num_y_labels,\n",
    "                dataset.word_vocab_size,\n",
    "                dataset.char_vocab_size,\n",
    "                word_embedding_dims=embedding_size)\n",
    "ner_model.load_embedding_weights(embedding_mat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "set batch size and number of epochs and fit the data on the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_size = 10\n",
    "num_epochs = 1\n",
    "\n",
    "ner_model.fit(x=[x_train, x_char_train, train_lengths], y=y_train,\n",
    "              batch_size=batch_size,\n",
    "              epochs=num_epochs,\n",
    "              validation=([x_test, x_char_test, test_lengths], y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "Once the model has trained. Run CONLLEVAL to see how well it performs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = ner_model.predict(x=[x_test, x_char_test, test_lengths], batch_size=1)\n",
    "eval = get_conll_scores(predictions, y_test, {v: k for k, v in dataset.y_labels.vocab.items()})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(eval)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
